The semester project gives you the chance to apply the knowledge and skills you will learn in the course in a way that mimics the ways you can expect to use logistic regression modeling in a real-world setting.
For the project, you will choose one data set from the two listed below and perform a logistic regression analysis. Specifically, you will build a regression model and report on the model statistics and diagnostics. Your final deliverable for the project will be an R Markdown file. The file will contain the analyses detailed in each step with 5-7 written bullet points. Basically, pretend you are presenting to a non-technical audience, and the bullet points would serve as a script outline for your explanation. By non-technical I mean an audience who knows things like “a p-value less than 0.05 mean statistical significance” but cannot really explain the underlying concepts in great detail.
You can choose one of two different data sets to complete the project, either the wine quality data set first analyzed during week 3 or the data set diabetes from the faraway package, which we do not cover elsewhere in the course. Here are the details for each. Remember, you just need to choose one, not both.
wine quality: rather than building a model to discriminate between white and red wine, for the project you may collapse the quality variable into a binary response equal to 1 (high) if quality > 5 and 0 (low) if quality <= 5. Build a logistic regression model to explain what factors are related to a high rating.
diabetes: the help file for the data states that glycosolated hemoglobin (denoted as glyhbin the data) greater than 7.0 is usually taken as a positive diagnosis of diabetes. Thus, collapse glyhb into a binary response equal to 1 if gly > 7 (positive) and 0 if gly <= 7 (negative). Build a logistic regression model to explain what factors are related to a positive diagnosis.
library(tidyverse)
library(ROCR) # <-- for AUC
library(ResourceSelection) # <-- for Hosmer-Lemeshow Goodness of Fit Test
library(splines) # <-- for splines
I am chosing the wine quality dataset for my analysis.
wine_red <- read_delim('winequality-red.csv',delim = ';') %>%
mutate(quality=factor(ifelse(quality>5,1,0),labels = c('low','high')))
names(wine_red)<-make.names(names(wine_red),unique = TRUE)
The project consists of five steps, which you will work on throughout the second half of the semester. The steps you complete each week will accumulate in your R Markdown file, due at the end of the course. You will receive credit for completing drafts of Steps 1- 5 according to the schedule below. Each of these steps are described in greater detail in the week they are due.
Task: Construct interleaved histograms and scatterplots to explore the relation between the predictors and response. Specifically, choose two predictors and make an interleaved histogram and scatterplot for each. Thus, you should construct four total graphs.
ggplot(
data = wine_red,
aes(x=alcohol,fill=quality)
) + geom_histogram(position = 'dodge',bins = 30) +
ggtitle("Histogram of Wine Quality by Alcohol Percent")+
xlab("Alcohol %")
ggplot(
data = wine_red,
aes(x=total.sulfur.dioxide,fill=quality)
) + geom_histogram(position = 'dodge',bins = 30) +
ggtitle("Histogram of Wine Quality by Total Sulfur Dioxide") +
xlab("Total Sulfur Dioxide")
ggplot(
data = wine_red,
aes(x=total.sulfur.dioxide,y=alcohol,color=quality)
) + geom_point(alpha=.5) +
ggtitle("Histogram of Wine Quality by Total Sulfur Dioxide") +
xlab("Total Sulfur Dioxide")+
ylab("Alcohol %")
Task: Perform variable selection via the AIC using the step() function. Your starting model should include all available predictors. The reduced model should be used as your final model for all subsequent steps. Or, if you disagree with the recommended model, you need to indicate why.
wine_red <- wine_red %>%
mutate(quality=ifelse(quality=='high',1,0))
saturated_model <- glm(quality~.,family=binomial,data=wine_red)
aic_step <- step(saturated_model)
## Start: AIC=1679.63
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol
##
## Df Deviance AIC
## - pH 1 1655.9 1677.9
## - density 1 1656.0 1678.0
## - residual.sugar 1 1656.7 1678.7
## - fixed.acidity 1 1657.5 1679.5
## <none> 1655.6 1679.6
## - citric.acid 1 1660.8 1682.8
## - chlorides 1 1662.1 1684.1
## - free.sulfur.dioxide 1 1662.9 1684.9
## - total.sulfur.dioxide 1 1690.2 1712.2
## - sulphates 1 1698.8 1720.8
## - volatile.acidity 1 1705.9 1727.9
## - alcohol 1 1730.0 1752.0
##
## Step: AIC=1677.9
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + sulphates + alcohol
##
## Df Deviance AIC
## - density 1 1657.2 1677.2
## - residual.sugar 1 1657.5 1677.5
## <none> 1655.9 1677.9
## - citric.acid 1 1661.2 1681.2
## - chlorides 1 1662.1 1682.1
## - fixed.acidity 1 1662.8 1682.8
## - free.sulfur.dioxide 1 1663.0 1683.0
## - total.sulfur.dioxide 1 1690.7 1710.7
## - sulphates 1 1701.1 1721.1
## - volatile.acidity 1 1706.8 1726.8
## - alcohol 1 1751.2 1771.2
##
## Step: AIC=1677.19
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## sulphates + alcohol
##
## Df Deviance AIC
## - residual.sugar 1 1657.8 1675.8
## <none> 1657.2 1677.2
## - citric.acid 1 1662.5 1680.5
## - chlorides 1 1663.1 1681.1
## - fixed.acidity 1 1663.3 1681.3
## - free.sulfur.dioxide 1 1664.2 1682.2
## - total.sulfur.dioxide 1 1691.4 1709.4
## - sulphates 1 1701.1 1719.1
## - volatile.acidity 1 1713.3 1731.3
## - alcohol 1 1839.5 1857.5
##
## Step: AIC=1675.84
## quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + sulphates +
## alcohol
##
## Df Deviance AIC
## <none> 1657.8 1675.8
## - citric.acid 1 1663.0 1679.0
## - chlorides 1 1663.5 1679.5
## - fixed.acidity 1 1664.2 1680.2
## - free.sulfur.dioxide 1 1665.2 1681.2
## - total.sulfur.dioxide 1 1691.4 1707.4
## - sulphates 1 1701.2 1717.2
## - volatile.acidity 1 1713.4 1729.4
## - alcohol 1 1843.8 1859.8
formula(aic_step)
## quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + sulphates +
## alcohol
Task: Check that continuous predictors have a linear relation with the logit using loess plots and perform the Hosmer-Lemeshow (HL) goodness of fit test. If the loess plot is nonlinear, then splines should be used to account for the nonlinearity.
loess_plot <- function(column_name){
f <- as.formula(paste("quality", column_name, sep = " ~ "))
y_smooth <- predict(loess(f, data=wine_red))
zero_one <- which(y_smooth>0 & y_smooth<1)
plot(jitter(wine_red[[column_name]])[zero_one],log(y_smooth[zero_one]/(1-y_smooth[zero_one])),
xlab = column_name)
}
loess_plot('alcohol')
The lowess plot for Alcohol clearly shows around 2 splines around 9.5% and 12.5%
loess_plot('volatile.acidity')
The lowess plot for Volatile Acidity does not show a strong need for splines
loess_plot('sulphates')
The lowess plot for Sulphates shows a clear need for a spline around 1
loess_plot('total.sulfur.dioxide')
The lowess plot for Total Sulfur Dioxide shows a potential spline around 40 is needed
loess_plot('free.sulfur.dioxide')
The lowess plot for Free Sulfur Dioxide shows splines around 10 and 15 could improve the model
loess_plot('fixed.acidity')
The lowess plot for Fixed Acidity shows that splines around 8 and 12 are needed
loess_plot('chlorides')
The lowess plot for Chlorides shows a spline around .09 should be added
loess_plot('citric.acid')
The lowess plot for Citric Acid shows that splines around .2 and .5 should be added
spline_formula <- formula(quality~ns(citric.acid,4)+
ns(chlorides,2)+
ns(fixed.acidity,2)+
ns(free.sulfur.dioxide,4)+
ns(total.sulfur.dioxide,3)+
ns(sulphates,4)+
volatile.acidity+
ns(alcohol,4))
step_with_splines_model <- glm(spline_formula,family=binomial,data=wine_red)
hl_test <- hoslem.test(wine_red$quality,aic_step$fitted.values,10)
hl_test
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: wine_red$quality, aic_step$fitted.values
## X-squared = 16.205, df = 8, p-value = 0.03954
hl_test <- hoslem.test(wine_red$quality,step_with_splines_model$fitted.values,10)
hl_test
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: wine_red$quality, step_with_splines_model$fitted.values
## X-squared = 10.983, df = 8, p-value = 0.2026
Most of the Loess Plots show that the predictors are nonlinear except Volatile Acidity and thus splines need to be used to create multiple regression lines for each of the remaining predictors in the model. Another choice could be to select a model like a decision tree or neural network that can model features with nonlinear behaviors
Task: Report p-values and confidence intervals for significant predictors and check for influential observations. Any influential observations should be removed and the model should be refit. Note any changes in the inferences due to the removal.
summary(step_with_splines_model)
##
## Call:
## glm(formula = spline_formula, family = binomial, data = wine_red)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5819 -0.8058 0.2813 0.7730 2.4647
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.78309 1.28210 -0.611 0.54134
## ns(citric.acid, 4)1 -0.18255 0.30052 -0.607 0.54356
## ns(citric.acid, 4)2 -1.45620 0.51768 -2.813 0.00491 **
## ns(citric.acid, 4)3 -0.83535 0.80108 -1.043 0.29705
## ns(citric.acid, 4)4 0.69174 1.38556 0.499 0.61760
## ns(chlorides, 2)1 -1.83924 1.01853 -1.806 0.07095 .
## ns(chlorides, 2)2 -1.73198 1.39238 -1.244 0.21354
## ns(fixed.acidity, 2)1 3.07964 0.79158 3.891 0.00010 ***
## ns(fixed.acidity, 2)2 -0.15121 0.67374 -0.224 0.82242
## ns(free.sulfur.dioxide, 4)1 0.58704 0.55610 1.056 0.29114
## ns(free.sulfur.dioxide, 4)2 -0.02082 0.56275 -0.037 0.97048
## ns(free.sulfur.dioxide, 4)3 0.58175 1.23333 0.472 0.63715
## ns(free.sulfur.dioxide, 4)4 1.89145 1.10156 1.717 0.08597 .
## ns(total.sulfur.dioxide, 3)1 -2.35477 0.56903 -4.138 3.50e-05 ***
## ns(total.sulfur.dioxide, 3)2 -1.34006 1.07490 -1.247 0.21252
## ns(total.sulfur.dioxide, 3)3 -1.74398 1.43544 -1.215 0.22439
## ns(sulphates, 4)1 2.59607 0.79839 3.252 0.00115 **
## ns(sulphates, 4)2 2.67761 0.68466 3.911 9.20e-05 ***
## ns(sulphates, 4)3 4.35350 1.83747 2.369 0.01782 *
## ns(sulphates, 4)4 2.52875 1.09639 2.306 0.02109 *
## volatile.acidity -3.41227 0.51351 -6.645 3.03e-11 ***
## ns(alcohol, 4)1 -0.40314 0.68396 -0.589 0.55558
## ns(alcohol, 4)2 2.62487 0.65818 3.988 6.66e-05 ***
## ns(alcohol, 4)3 1.03219 1.61410 0.639 0.52251
## ns(alcohol, 4)4 3.54187 1.34294 2.637 0.00835 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2209.0 on 1598 degrees of freedom
## Residual deviance: 1583.7 on 1574 degrees of freedom
## AIC: 1633.7
##
## Number of Fisher Scoring iterations: 5
Task: Summarize predictive/discriminatory power of the model with an ROC curve and plots of predicted probabilities.
pred <- prediction(step_with_splines_model$fitted.values, wine_red$quality)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10), main="ROC Curve")
auc.tmp <- performance(pred,"auc");
auc <- as.numeric(auc.tmp@y.values)
cat(paste("The AUC for my selected model is:",round(auc,3)))
## The AUC for my selected model is: 0.839
plot(wine_red$fixed.acidity,step_with_splines_model$fitted.values)
plot(wine_red$volatile.acidity,step_with_splines_model$fitted.values)
plot(wine_red$citric.acid,step_with_splines_model$fitted.values)
plot(wine_red$chlorides,step_with_splines_model$fitted.values)
plot(wine_red$free.sulfur.dioxide,step_with_splines_model$fitted.values)
plot(wine_red$total.sulfur.dioxide,step_with_splines_model$fitted.values)
plot(wine_red$sulphates,step_with_splines_model$fitted.values)
plot(wine_red$alcohol,step_with_splines_model$fitted.values)
Comments on Variables Selected: